Compression of Semistructured Documents

نویسندگان

  • Leo Galambos
  • Jan Lansky
  • Katsiaryna Chernik
چکیده

EGOTHOR is a search engine that indexes the Web and allows us to search the Web documents. Its hit list contains URL and title of the hits, and also some snippet which tries to shortly show a match. The snippet can be almost always assembled by an algorithm that has a full knowledge of the original document (mostly HTML page). It implies that the search engine is required to store the full text of the documents as a part of the index. Such a requirement leads us to pick up an appropriate compression algorithm which would reduce the space demand. One of the solutions could be to use common compression methods, for instance gzip or bzip2, but it might be preferable if we develop a new method which would take advantage of the document structure, or rather, the textual character of the documents. There already exist a special compression text algorithms and methods for a compression of XML documents. The aim of this paper is an integration of the two approaches to achieve an optimal level of the compression ratio. Keywords— Compression, search engine, HTML, XML.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Semantic Lossy Compression of XML Data

In the last years a large amount of semistructured data [1, 10] has been managed and exchanged. The largest repository of semistructured data is the World Wide Web, which can be thought of as an enormous database in which data is highly heterogeneous and freely correlated. In this scenario is placed Extensible Markup Language (XML) [14], a language for semistructured data standardised by the Wo...

متن کامل

Semistructured Data Store Mapping with XML and Its Reconstruction

XML has been quickly emerging as a dominant standard for data representation and exchange on the World Wide Web for its many good features such as well-formed structure or semantic support. Research on semistructured data over the last several years has focused on data models, query languages, and systems where the database is modeled in some form of a labeled, directed graph. Processing this a...

متن کامل

Using structural contexts to compress semistructured text collections

We describe a compression model for semistructured documents, called Structural Contexts Model (SCM), which takes advantage of the context information usually implicit in the structure of the text. The idea is to use a separate model to compress the text that lies inside each different structure type (e.g., different XML tag). The intuition behind SCM is that the distribution of all the texts t...

متن کامل

Instance-Independent View Serializability for Semistructured Databases

Semistructured databases require tailor-made concurrency control mechanisms since traditional solutions for the relational model have been shown to be inadequate. Such mechanisms need to take full advantage of the hierarchical structure of semistructured data, for instance allowing concurrent updates of subtrees of, or even individual elements in, XML documents. We present an approach for concu...

متن کامل

Discovery of Frequent Tag Tree Patterns in Semistructured Web Documents

Many Web documents such as HTML files and XML files have no rigid structure and are called semistructured data. In general, such semistructured Web documents are represented by rooted trees with ordered children. We propose a new method for discovering frequent tree structured patterns in semistructured Web documents by using a tag tree pattern as a hypothesis. A tag tree pattern is an edge lab...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006